Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results

نویسندگان

  • Witold Litwin
  • Riad Mokadem
  • Philippe Rigaux
  • Thomas Schwarz
چکیده

Extended Abstract We propose a novel string (pattern) matching algorithm called n-gram search. We intend it for the records stored once and searched many times in a database or a file, especially organized into a Scalable Distributed Data Structure, (SDDS), over a grid or a structured P2P net. We presume that the records are encoded into their cumulative algebraic signatures, providing incidental confidentiality of stored data. The search starts with pre-processing the pattern, calculating the logarithmic algebraic signature (LAS) of the pattern and the LASs of every n-gram in it. The value of n ≥ 1 is a parameter that one may tune. The search attempts to match the LASs of n-grams in the pattern towards dynamically calculated LASs, sampled over n-grams in the records. A mismatch generates a shift of up to K-n symbols towards next sample, where K is the pattern length. The whole process is parallel over the SDDS servers and does not require any local decoding. For an M-symbol long record, the unsuccessful search, measured as number of match attempts, costs O ((M-K) / (K-n+1)). The 2-grams should typically suffice, leading to O ((M-K) / (K-1)). We show that the algorithm particularly efficient for larger strings and records, i.e., with e-documents or DNA data. Preliminary results show then the n-gram search about (K n + 1) faster than our previous algorithms and among the fastest known, e.g., probably often faster than Boyer-Moore.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast String Search Using n-Gram Sampling of Cumulative Algebraic Signatures : Preliminary Results

We propose a novel string (pattern) matching algorithm called n-gram search. Unlike the prominent algorithms, ours does not process the visited string (record) and the pattern directly. We intend it for the records stored once and searched many times in a database or a file, especially organized into a Scalable Distributed Data Structure, (SDDS), over a grid or a structured P2P net. Instead, it...

متن کامل

Computing sampling points in a semi-algebraic set defined by non-strict inequalities, application to Pattern-Matching Problems

We focus on the problem of computing sampling points in a semi-algebraic set defined by equations and non-strict inequalities. This problem is reduced to computing sampling points in several real algebraic varieties, represent these points by rational parametrizations and decide the sign of some polynomials at the real solutions of these parametrizations. We show how these tasks can be deduced ...

متن کامل

SigMatch: Fast and Scalable Multi-Pattern Matching

Multi-pattern matching involves matching a data item against a large database of “signature” patterns. Existing algorithms for multipattern matching do not scale well as the size of the signature database increases. In this paper, we present sigMatch – a fast, versatile, and scalable technique for multi-pattern signature matching. At its heart, sigMatch organizes the signature database into a (...

متن کامل

AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures

AS-Index is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternate structures, tree or trie based, and indexes every n-gram in the database. supported as well. The hash function relies on the algebraic signatures of the n-grams. Use of hashing provides for constant index access time for arbitrarily long patterns, unlike other structures...

متن کامل

AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures

AS-Index is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternate structures, tree or trie based, and indexes every n-gram in the database. supported as well. The hash function relies on the algebraic signatures of the n-grams. Use of hashing provides for constant index access time for arbitrarily long patterns, unlike other structures...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006